Method for quantitative analysis of complex proteomic data

ABSTRACT

This invention is a novel method for analysis of data that is produced by test equipment. The preferred embodiment is data produced by liquid chromatography tandem mass spectrometry (LC-MS/MS) equipment, using industry standard methods to generate the initial data from the test equipment. The invention is a method for processing of the data to promptly produce accurate, reliable, and meaningful data that can be used for critical decisions. The unique benefit of the method is to correct the multiple measurement and calculation errors that are inherent in the operation of laboratory equipment. Prior methods result in errors based on circumstances that are difficult to control, accuracy-related errors in machine measurements, and fundamental mathematical errors in the data processing software that used with the laboratory equipment. As an added benefit, this novel method allows comprehensive simultaneous measurement and calculation of correlation of any and all peptide pairs in a single measurement, with the capability to support repeated measurements with changed conditions over time. This novel method allows robust, detailed, and comprehensive measurements of peptide activity and function, which results in substantial improvements over prior methods in accuracy, reliability, and efficiency.

PRIORITY PATENT APPLICATION

This patent application is a continuation patent application drawingpriority from U.S. patent application Ser. No. 13/068,026; filed May 2,2011. The entire disclosure of the referenced patent application isconsidered part of the disclosure of the present application and ishereby incorporated by reference herein in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

A portion of the research activities involved in the refinement of themethods described herein was supported by U.S. government funding fromthe National Institutes of Health, listed under NIH Funding AgreementNumbers HG003864 and CA 126764.

COMPUTER PROGRAM LISTING APPENDIX

Computer software is attached in four compact discs, which are twoidentical sets of Disc One and Disc Two. The contents of the compactDisc One and Disc Two are incorporated by reference as part of thisapplication. Disc One contains one ASCII file which is the instructions,written in Java computer programming language, of the sequence ofcalculation procedures that are the preferred embodiment for processingof complex data produced by laboratory equipment. Disc Two contains oneASCII file which represents the screened and processed data from typicaltest results in a useful output format. The source code and data outputformats perform under either Windows or Macintosh operating systems. Thedata shown on Disc Two is an exhibition of results from typical massspectrometry measurements. This computer software processes ambiguoustest results into statistically significant data that is useful forshowing the relationships between active elements within a complexsystem, such as the immune response communications network within anorganism. A portion of the disclosure of this patent document containsmaterial that is subject to copyright protection. The copyright ownerhas no objection to the facsimile reproduction by anyone of the patentdocument or the patent disclosure, as it appears in the Patent andTrademark Office patent file or records, but otherwise reserves allcopyright rights whatsoever.

BACKGROUND Field of the Invention

The present invention relates to data processing methods that store,search retrieve and process cellular and biochemical informationefficiently. The invention is a method that offers substantial benefitsas contrasted with prior methods, to allow extremely accurate analysisof complex proteomic data produced by test equipment. More specifically,the method uses the data output of mass spectrometry equipment toproduce refined measurements of protein functions and to infer proteininteractions, including functions over a complex network of proteininteractions within and between biological cells. The resultinginformation identifies and describes the significant functionalrelationships for each protein within a group of proteins that arecomponents of a larger biological system. The biological system may be asingle cell, a group of cells, or an entire organism. This unique methodincludes the following functions:

(a) accurate calculation of the relative abundance and activity of eachprotein for each functionally relevant categorical grouping, based onthe measurements from laboratory test equipment, gene and proteinsequence data from an external database, and the standard software thatis used with the mass spectrometry equipment.

(b) multiple screening to reject errors, sorting by specified criteria,and specification of the functional relevance of proteins within abiological system. Based on sequential measurements over time orfollowing a controlled change in conditions, the calculations includethe amount of incremental change in the activity of proteins, and thecorrelation of activity and patterns of activity to identify and measurethe functionally relevant classifications, for all possible combinationsand permutations between all observed peptides, proteins, and functionalcategories.

(c) detailed description of the calculated results in a manner thatshows the structure of the complex relationships, as a graphic patternthan can be readily understood by the equipment operator.

Description of the Related Art

As a result of the advances in genomic sequencing technologies, completegenomic sequences have been derived for several species. Before thesetechnological advances, it was not possible to determine the completesequence of deoxyribonucleic acids (DNA) in an organism and organize thesequence into functional genes. However, in recent years the technologyfor sequencing genes has advanced so that it is not only feasible todetermine complete genomic sequences but also to quantitatively measurethe abundance of every expressed gene, based on mRNA levels for everygene found in a cell.

Nevertheless, analyses of gene sequence and abundance do not providesufficient information to explain the mechanism, functions, and activityof biological processes. Proteins are essential for the control andexecution of virtually every biological process. Accurate correlationsbetween gene sequences and protein functions are limited by the degreeof similarity between sequences and the availability of priorexperimental results that demonstrate correlation or causation forprotein function under specified conditions. Genomic data fails toprovide correlations between biological processes and proteinactivities. The state of protein activity in a cell cannot be determinedby gene sequence or the expression level of the corresponding mRNAtranscript. Therefore, novel methods are required to monitor biologicalprocesses in terms of protein function.

Determining the complete sequence of DNA and mRNA for an organism isonly a partial solution to the larger issues of how to understand basicbiological functions. Advances in genomic research were based in largepart on developments in computer technology and insightful softwaredesign. Using new computational approaches researchers were able togenerate, store, organize and analyze large amounts of sequencing data.The investigation of protein function at a similar scale also requiresadvances in computation and software design so that large amounts ofcomplex data can be accurately analyzed. Software designed for theanalysis of protein function at a cellular and organismal scale facesseparate challenges than those considered in genomic research.

The critical unsolved issue is to accurately describe the functions ofall proteins that are derived from the genome, in terms of proteinactivity and functions over time or following controlled changes inconditions. This protein activity must also be understood with regard tothe interactions of multiple proteins within a total system. Theidentity of each protein is based on messenger RNA transcribed from theDNA sequence of a gene. The function of each protein is determined bychanging conditions within and around each cell. Accordingly,measurement of protein functions, and description of proteininteractions and complex interrelationships, is separate from sequenceidentity.

Proteomics is the study of protein activity, functions and interactions.The scope of protein interactions depends on the extent of the cellularsignaling network. Accordingly, the effects of protein activity mayinclude interactions within a cell, interrelated interactions among anintegrated group of cells, or complex interactions within an entirebiological entity. With current technology, the critical unsolved issuesinclude accurate measurement of protein function and activity,description of the protein-to-protein interactions, and the effects ofselected compounds for modulation of protein activity. Specializedequipment and methods are a practical necessity to approach newchallenges in the field of proteomics.

The process of inter- and intracellular signaling involves a complexnetwork of protein interactions that change rapidly in response todifferent stimuli. Despite the critical importance of protein activityand protein interactions, the accurate measurement of incrementalchanges in function over time has remained an unresolved issue. Theaccurate measurement of protein function is made particularly difficultby the frequent modulation of post-translational modifications thatsignificantly alter protein function. Changes to protein function areassociated with critical human diseases, such as sepsis, emphysema, andvarious forms of cancer. To address the mechanistic cause andprogression of these diseases, it is essential to measure biologicalactivity in terms of quantitative changes in protein function.

Overview of Proteomics Technology

Proteome analysis is typically based on the separation of complexprotein samples by one-dimensional or two-dimensional gelelectrophoresis (2DE) and liquid chromatography, followed byidentification of the individual protein species (Gygi and Aebersold1999). Spectrometric techniques and basic computer algorithms have beendesigned to rapidly identify proteins by matching peptide mass spectradata to protein and translated nucleic acid sequence databases (Eng,McCormack et al. 1994; Yates III, Eng et al. 1995).

The prior art is shown in the listing of patents and relevant technicalpublications. The failures of prior art are demonstrated by theomissions in prior patents. The methods proposed by prior patents focuson the accurate identification of proteins by amino acid sequence andmodifications. For example, prior art recognizes the need to considerthe statistical significance of possibly erroneous matches (Sachs, 2005,page 18, lines 49-51) and potential errors caused by incomplete sequencedatabases, incomplete splicing, protein modifications, or proteinpolymorphism (Emili, 2005, paragraph 0199). It is recognized thatreliance on the standard software contained in an automatic searchengine (e.g., Mascot) and a protein database results in a great numberof errors, termed pseudo-positive results, but the proposed solutions donot provide a standard method to prevent or correct the identifiederrors (Oda, 2008, paragraph 2065). The establishment of a customdatabase to correct errors in a single run (Oda, 2008, paragraphs2065-2068) merely confounds the problem because the custom database isnot shared or verified by other unbiased independent investigators.

Quantitative methods for proteomic research assign measures of abundanceto identified proteins. To date, quantitative proteomic methods lagbehind methods for the identification of proteins from complex samples.Currently applied quantitative methods fail to provide statisticalconfidence intervals and correct for sources of measurement error. Forexample, exclusion of measured data through the subjective exclusion ofoutliers (Sachs, 2005, page 18, lines 54-60) results in data that is theresult of investigator selection of preferred data, as contrasted toempirical and unbiased measurement. A recent patent describes a methodfor measurement of protein phosphorylation with mass spectrometry, butno method to correct the underlying software efforts (Hunt, 2006, page6, lines 5-15). Similarly, a recent patent describes methods to providea baseline for quantitative comparison through internal controls, but nomethod to screen out erroneous equipment measurements or incorrectsoftware calculations (Aebersold, 2009, page 7, lines 12-22).

Prior methods note that keyword categories are useful to selectivelyfocus on biological functions of interest within a database (Yamashita,2010, page 8, lines 10-15), (James 2007, paragraph 0013). However, theprior methods do not include quantitative measures of abundance forkeyword categories nor do they calculate statistical correlationsbetween all observed categories. Protein sequence similarity alone hasalso been used to infer functional similarity and molecular interactions(Mallal, 2006, paragraphs 0007-0012). However, this method merelycompares selected sequences and does not provide for a quantitativemeasure of the degree of shared functionality between all possibleproteins within and between samples.

Significantly, the widely used standard methods and software for massspectrometry analyses are based on the original formats and codesdesigned over a decade ago. These early developments containedfundamental errors. Accordingly, it is not surprising that the currentsoftware and associated methods fail to correct critical calculation andmeasurement errors. Recent studies by several investigators demonstratethat current methods exhibit significant errors in repeatability andreproducibility, so that typical results cannot be reliably reproducedeven with the same machine, same sample, and same operator (Tabb,Vega-Montoto et al. 2010). The existing computer software, testprotocol, and screening processes have not kept pace with the currentneed for analytical details that precisely describe protein functionwithin an interaction network.

Accurate analysis of protein functions present difficult technicalobstacles. Protein functions are interrelated as shown by the complexsignaling within and among cells. Accordingly, measurement of proteinfunctions over time is a continuing challenge. The required measurementsinclude accurate identification, physical count of abundance, extent ofactivity, and the extent of interaction between any two proteins. Thewidely accepted equipment, consisting of liquid chromatography combinedwith tandem mass spectrometry (LC-MS/MS), is adequate to provide theessential input data. However, substantial improvements in software andcomputational methods are necessary to correct errors that result fromthe use of standard but outdated software to analyze data obtained bymass spectrometry.

Current Test Procedures For Mass Spectrometry Equipment

Continuing technical developments have allowed improvements in massspectrometry equipment and procedures. The following is a typicalprocedure for the identification of proteins using mass spectrometry.Samples are prepared from cell lysates that contain tens of thousands ofdistinct protein species. The sample can be simplified by separatingproteins based on size using gel electrophoresis and isolating slices ofthe gel that contain only several hundred proteins per slice. Handlingeach slice separately allows the identification of a more completeportion of the original sample. Sample proteins are broken up into shortsegments termed peptides using a proteolytic enzyme. This resultingmixture of peptides is then separated by liquid chromatography (LC).

This separated mixture is injected into the mass spectrometer, whichmeasures the mass/charge ratio (m/z) for ionized peptides. Then, the MSequipment selects individual peptide ions, fragments them usinglow-energy collisions, and measures the mass of the ion fragments toobtain amino acid sequence information. The observed m/z ratio of theintact and fragmented peptide ions allows inference of the amino acidsequence. External software accomplishes this by matching fragmentationdata to sequence databases. Meanwhile, the intensity of the intactpeptide ions allows measurement of relative abundance for species thatshare the same ionization potential. Internal standards have beendeveloped for the purpose of providing a means for accuratequantitation. A typical MS/MS test procedure takes several hours and mayproduce hundreds of thousands of line items.

Specialized software is required to interpret the data produced by massspectrometry equipment, with emphasis on matching the measured sequenceto a database that allows accurate identification of each protein. Overthe past 16 years, several data analysis programs have been developedfor protein identification (Xu and Ma 2006). Typical commerciallyavailable software includes: Sequest (Eng, McCormack et al. 1994;MacCoss, Wu et al. 2002; Sachs, Wiener et al. 2005), Mascot (MatrixScience) (Perkins, Pappin et al. 1999), and Peaks (Ma, Zhang et al.2003). Many additional programs have been developed, typically usingdifferent scoring functions, and different methods for error correctionand interpretation of the MS/MS results.

However, despite the many alternative mass spectrometry softwareprograms that are available, these programs exhibit serious deficienciesthat prevent accurate measurement and detailed analyses. Based on theidentified deficiencies in the prior methods, there is basic need for anovel method to meet the requirements. The novel method must correcterrors in the mass spectrometry measurements, correct errors in thesoftware used by the MS/MS equipment, and screen out results that failto meet accuracy criteria. The novel software must allow accuratemeasurement of protein functions and the full range of multipleprotein-protein interactions.

With prior methods, test results were typically inconclusive, due to theinherent complexity in the identification of major biological trends anderrors in the measurement of relative protein abundance.

The Need For New Mass Spectrometry Data Analysis Methods

Significantly, the prior methods and software exhibit serious unresolvedissues, such as multiple identification errors, counting errors, andeven basic mathematical errors in the original software code.Importantly, the existing methods do not produce the details necessaryto derive protein function, or to measure statistical correlations thatallow inference of protein-protein interaction for each protein pair inthe sample. Accordingly, the existing methods fail to provide theinformation required for important protein analysis decisions, such asthe formulation and design of new compounds for diagnosis or treatmentof disease.

Thorough use of mass spectrometry equipment and related methods haveresulted in clear identification of the requirements for new methods.Accurate measurement and complete disclosure of all measured data isnecessary. The measurements must allow precise and unambiguousidentification of each protein from the peptide samples. As a practicalnecessity, the new methods must be able to use the data output ofexisting software and testing methods, so that each existing laboratoryis not required to purchase new equipment or to learn to use entirelydifferent methods. The transition from the old methods to the new methodshould be without severe obstacles.

There is a need for an enhanced spectrum of information. Mereidentification and classification is not sufficient. The data mustsupport calculation of protein functions and protein interactions overtime or following controlled changes in conditions. Detailedmeasurements of protein activity are necessary to describe andunderstand cellular signaling, deficiencies in the immune system, andthe effects of modulation of signaling through inhibitors. This detailedinformation as to protein function is necessary to design effectivetreatments of critical diseases, such as cancer or sepsis.

SUMMARY OF THE INVENTION

Embodiments of the invention described herein provide a method ofanalysis and translation of data that represents chemical structuressuch as proteins, and fragments thereof such as peptides, intoinformation that can be used for critical decisions, based on themeasured activity and function of specific proteins and the complexinteraction of an entire group of proteins. In the preferred embodiment,the method would provide a firm foundation for decisions regarding theeffects of attempted modulation of protein activity, such as thefunctional effects of an inhibitor on kinase activity.

The invention is a novel calculation method that allows the existingdata output of the mass spectrometry equipment to be processed in aunique manner so that the resulting information output is accurate andreliable. The invention provides accurate identification of each proteinfrom the peptide segment, based on a detailed review of specificcriteria to assure that errors from the mass spectrometry equipment andassociated software are rejected. Also, the invention calculates anddisplays the amount of interaction between all pairs of genes andkeyword categories in the sample. Based on this invention, accurateinformation is provided for critical issues, such as accuratemeasurement of protein functions and protein network signaling.

Benefits and Advantages of the New Method

Significantly, the new method provides a practical, feasible solution tounsolved critical issues. The new method provides multiple screening oferrors, to provide refined data that allows accurate identification ofeach and every protein. The new method allows identification of themeasured intensity of the protein-protein interaction for each and everypair of proteins. In addition, the novel method displays the results ina manner that converts the complexity into a meaningful pattern, so thatthe investigator can understand the results.

The new methods allow analyses that clearly show the extent of proteinactivity within a sample, and to clearly detect and describe thecorrelation of activity for each pair of proteins. The novel methodscorrect the various measurement and counting errors that are nowincluded in the data output from current mass spectrometry equipment,due to the outdated software. The new methods solve the criticalproblems of verification of protein identity, measurement of abundanceand activity for each identified protein, and the measured amount ofinteraction between all possible combinations of any two proteins. Thenovel methods provide a valuable solution to these previously unresolvedissues.

The invention includes multiple screening and correction of calculationerrors inherent in the data output from the mass spectrometry equipment.As a unique benefit, the new method provides the data in a format thatallows pattern recognition to support understanding of the test results.The complexity is distilled into a structured format that allowsreasoned conclusions, supported by statistical tests of confidence inthe data presented.

The invention provides a method for to measure the correlation andco-activity or all combinations and permutations for any two proteinswithin the sample. Accordingly, the method allows accurate measurementand comprehensive description of effects of inhibitors on proteinactivity. Significantly, based on separate measurements over time, thefunctional effect of an inhibitor on protein activity can be derived,based on plots of data that describe incremental changes over time, andthe resulting mathematical formulas the show the underlyingrelationships. Based on accurate measurements, the invention supportsmeaningful tests of the effects over time of compounds that modulateprotein activity, protein response to changes to communication networksignaling, and the selection of chemical probes, candidate compounds, ormolecular targets.

Pattern recognition is a practical requirement for complex data sets,which may have inconsistent details, or conflicting criteria, or subtlerelationships. This pattern recognition is a practical necessity becauseof the amount of data that is involved with proteomics is so voluminousthat the relevant data cannot be reasonably viewed and understood by ahuman within the time required for decisions and actions.

Specialized software is a practical necessity to sort data, comparemeasurements with known information, derive relationships, and displaythe data results in a manner that allows recognition of the fundamentalpatterns by a human observer. Based on this pattern recognition, thehuman can make decisions and take actions based on facts, as contrastedto conflicting opinions or a welter of disorganized information.Proteomics involves such a vast volume of data, relationships, andcontingent effects that a human expert cannot easily understand theunderlying patterns without a clear display of the complex data.Significantly, this invention displays the underlying pattern from alarge database with complex test data in a manner than allows promptunderstanding and decision. This invention produces outputs in the formof a tabular listing and graphical displays which can be ready andquickly understood by a human.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram that depicts data processing procedures basedon the invention which corrects prior errors from equipment and softwareand provides accurate output data for detailed analysis. The inventionprovides multiple screens with specific criteria for rejection oferroneous data, so that the resulting data is accurate and reliable. Thedata results describe protein activity, and protein-protein interactionfor each possible pair of proteins. Based on iterative runs, the methodallows description of network configurations, and the functionalrelationships of protein function over time.

FIG. 2 is a flow diagram that depicts the establishment of a control, astechnical replicates with reverse isotopic labeling, by tagging withlight and heavy molecular characteristics, and with separate calculationof the results for the two conditions, to establish a foundation forcomparison of the control versus the LPS treated samples.

FIG. 3 is a flow diagram that depicts the major steps for the inventivemethod, including screening with 6 separate user-defined filters,screening for statistical significance, and grouping of results by geneand keyword category. The results include detailed reports of passedversus rejected data, and the amount of protein-protein interaction foreach possible combination and permutation of protein pairs.

FIG. 4 is a flow diagram that depicts the threshold values forstatistics and heuristics. Statistical inference tests whether theresults occurred due to random events. Heuristics tests whether theresults are close to an approximate solution by an alternative method.For either approach, the results are verified by comparison toreplicates.

FIG. 5 is a table that depicts specific rejection criteria for four ofthe six independent screens which act to accept valid data and to rejecterroneous data, based on criteria established by the investigator withregard to basic biochemical relationships. The post-processing filterscontrol the quality of data that is used to derive the calculatedresults. For this test, over half of the 15,925 peptide sequences (PS)that were accepted by standard methods were rejected due to themeasurement and data processing errors identified by the inventivemethods.

FIG. 6A is a chart that depicts the variation from a normal distributiondue to skewness. FIG. 6B shows the variation from a normal distributiondue to kurtosis. In black, the frequency distribution is shown forheavy-labeled vs. light-labeled peptide ion areas. In grey, thefrequency distribution is shown for heavy-labeled vs. total pairedpeptide ion areas. The frequency distribution is calculated fromlog2-transformed fold change ratios.

FIG. 7 is a table that depicts the large proportion of erroneous resultsbased on standard methods, and the effect of the statisticalsignificance filters provided by the invention. For Experiment 1, from4,796 peptides, although 720 genes were recognized by the standardmethods, after rejecting insignificant data only 29 genes remained.Similarly, for Experiment 2, analysis using standard methods for 5,046peptides resulted in 749 genes recognized, but only 25 genes passed thestatistical filters. A similar proportion of errors are recognized whenscreened by keyword categories.

FIG. 8 is a chart that depicts a significant change in abundance inresponse to lipopolysaccharide (LPS), as compared to the control group,showing protein signaling response to gram negative bacteria. Experiment1 is shown in black and Experiment 2 is shown in grey. Within Category 1are the genes that changed more than one standard deviation from thetotal peptide population mean. Within Category 2 are the few genes withstatistically significant means from the peptide population which passednormality constraints. Within Category 3 are results that are rejectedbecause calculated results did not agree with experimental replicates.

FIG. 9 is a chart that depicts results similar to FIG. 8, but withmeasurements grouped by keyword category instead of by gene. WithinCategory 1 are the keywords that changed more than one standarddeviation from the total peptide population mean. Within Category 2 arethe keywords with statistically significant means from the peptidepopulation which passed normality constraints. Within Category 3 areresults that are rejected because calculated results did not agree withexperimental replicates.

FIG. 10 is a chart that depicts protein-protein interaction as shown bykeyword categories for protein function. The number score for each pairdepicts the amount of interaction, from 0 to 1.0.

FIG. 11 is a chart that depicts the protein-protein interaction forspecific genes with Akt inhibition of lipopolysaccharide (LPS), for bothlight and heavy isotope-labeled peptides.

FIG. 12 is a chart that depicts the protein-protein interaction forspecific keyword categories with Akt inhibition of lipopolysaccharide(LPS), for both light and heavy isotope-labeled peptides. The keywordgrouping allows improved insight into the protein functions, as comparedto grouping by genes.

DETAILED DESCRIPTION OF THE INVENTION Discussion of Proteomics

Proteomics is the study of protein functions. By contrast, genomics isthe study of gene functions. The complete set of proteins expressed by acellular genome under specified conditions is popularly referred to asthe proteome. Over the past decade, there has been a significant effortto comprehensively describe differences between cellular states bychanges in the proteome. Differences in protein expression andmodification have been used to investigate the pathology of diseaseprocesses (Hanash 2003) and highlight differences in gene regulation(Mootha, Bunkenborg et al. 2003). Technological advances allowincreasingly efficient parallel quantitative analyses of proteinabundance and modifications (Aebersold and Mann 2003).

The proteome is a more dynamic counterpart to the genome and proteomicexperiments can generate a staggering amount of data despite incompletesampling of all cellular proteins (Cox and Mann 2007). Nevertheless, thepromise of proteome research as a tool for the identification ofcharacteristic patterns of protein activity is critically hindered bythe challenge to analyze and interpret proteomic data in a biologicalcontext. Computational approaches based on the compilation of observedmolecular interactions have been developed to reference and buildprobable networks of protein activity (Kitano 2002; Joyce and Palsson2006; Ekins, Mestres et al. 2007; Sharan, Ulitsky et al. 2007).

Despite the prior advances, the amount of time required to generatebiologically reasonable hypotheses from proteomic data can be asignificant challenge for data analysis. Techniques to systematicallyanalyze protein networks will not be successful without a standardizedmethod of data analysis that determines quantitative significance at aglobal scale and meaningfully organizes data using graphical aids forthe visualization of patterns.

Mass spectrometry is one method of proteome investigation for whichthere are several mature software tools for the identification ofpeptide sequences, validation of these identifications and quantitationfrom raw spectral data. However, the unsolved issues include fundamentalproblems of interpreting and analyzing the resulting data (Aebersold andMann 2003). The basic problems stem from outdated software and methodsthat did not keep pace with the challenges of proteomics.

As a fundamental requirement, the new method must accept the widelyencountered formats of proteomic data and produce commonly acceptedstatistics to guide researchers to the genes of greatest interest. Theapplication transparently incorporates heuristics that have been testedon extensive experimental research, so that the researcher has effectivecontrol over a large number of parameters. By contrast, existing massspectrometry analytical software forces the user to accept hard codedparameters buried within the code.

As a unique advantage, this new approach provides the end user with aset of sieves for finding a range of interesting values instead of adrill for mining out a single value. One major part of this approach isthe capacity for iterative analysis, to allow multiple analyses of thedata using different criteria. With this new method, the researcher canefficiently analyze a batch of data to produce a specific set of resultsthen modify the search parameters to generate another set of results.

This rapid iterative approach allows insight into the functionalrelationships, as contrasted to the slow process of detailed inspectionof individual data points without getting any sense of the overallrelationships. The novel methods allow the researcher to quickly discernthe structure within the complexity. Ultimately the goal of manyproteomic experiments is to provide a snapshot of groups of proteinsinvolved in functional roles (Gavin, Bosche et al. 2002). Thisfunctional categorization of proteomic data is dependent on theavailability of complete genome sequences and searchable databases withentry definitions.

While popular databases, such as Gene Ontology and UniProt, have definedsets of keywords for organizing genes, existing software will often usethese solely for qualitative groupings and will not take full advantageof these features for quantitative analyses. Our application can use thekeyword as a unit of statistical analysis, which provides larger samplesizes and a more global view of the data. The application can alsoexplore relations between keywords, which has implications for networkanalysis. Moreover, this data may contain many false correlations.

As a basic requirement, the new method must use simple heuristics tofilter and sort data, meaningfully organize the remaining data and applyintuitive statistical methods to highlight genes that showed significantchanges. Several suites of software have been developed to work with rawdata. This program does not deal with raw data. Instead, it organizesdata and so that the user can interpret it. This requires filters toremove data of poor quality and sorting for easy comparisons at thelevel of peptides, genes, and experiments. It also requires statisticsand heuristics so that decisions can be made according to a uniformapplication of values. The data are defined by a peptide, by a gene, orby a population of genes. Multiple variables are addressedsystematically in a specified order.

A longstanding goal in biology has been to identify proteins that areindicative of inflammatory signaling events pertaining to the responseof neutrophils to lipopolysaccharide. To accomplish this we needed apost-processing platform that was capable of calculating relativeprotein abundance and mapping identified proteins according to function.Several software suites are available to generate graphics from proteinannotation notes derived from database searches. Existing software doesnot take into account functional overlap between categories. Thesoftware was designed for mass spectrometry but is applicable in theoryto all proteomic analyses.

EXPERIMENTAL PROCEDURES

The following described procedures are the preferred embodiment for useof the invention. Improved procedures may be developed and the use ofthe improved procedures do not limit the scope or applications of theinvention.

Sample Preparation

To provide a source of data that was rich in content for functionalannotation and had been extensively analyzed through various methods inthe past, we prepared a complex soluble protein sample from neutrophilsactivated by LPS. To normalize the conditions under which samples wereprepared, the human promyelocytic HL-60 cell line (ATCC) wasdifferentiated in culture using 1 μM ATRA, 6 pM 1α,25-DihydroxyvitaminD3, and 30 ng/mL G-CSF in IMDM, supplemented with 20% FBS and 4 mML-glutamine. Cells were activated via treatment with 100 ng/mL oflipopolysaccharide (LPS) from E. coli O111:B4 (List BiologicalLaboratories, Campbell, Calif.). The control sample was treated with anequal volume of double-distilled and autoclaved water. Cells wereharvested, lysed, and enriched for phosphorylated proteins using thePro-Q Diamond phospho-enrichment kit (Invitrogen, Carlsbad, Calif.),following kit instructions and as previously described (Kristjansdottir,Wolfgeher et al. 2008). Fractions were collected, concentrated, andwashed with 0.25% CHAPS in 25 mM Tris, pH 7.5, by centrifugation at 4°C. using 10 kDa-cutoff concentrators (Millipore, Billerica, Mass.) for afinal volume near 500 μL.

The total protein content of eluted fractions was determined by Bradfordanalysis (Pierce, Rockford, Ill.) using the average of triplicates.Total protein content was also qualitatively compared by the intensityof Coomassie staining (Thermo Scientific, Rockford, Ill.) following gelelectrophoresis. LPS-treated and control samples were loaded at equaltotal protein content for separation on 4-12% NuPAGE gradientelectrophoresis gels (Invitrogen) using MOPS SDS running buffer. Gelswere cut into 11 vertical slices, combining 9 replicate lanes for eachvertical slice to increase protein abundance per sample.

Gel slices were de-stained with 50% CH3CN in 100 mM NH4HCO3, pH 7.5, for15 m, reduced with 20 mM TCEP in 50 mM NH4HCO3 for 30 m at 37° C., andacetylated with 50 mM iodoacetamide for 30 m in the dark. Gel sliceswere washed with ultrapure water and dehydrated in CH3CN for 5-10 m,which was removed by vacuum centrifugation. Proteins were digestedin-gel by re-hydrating each gel slice with 2 μg of trypsin in 60 mMNH4HCO3 with 0.5 mM CaCl for 12 h at 37° C. Peptides were extracted fromgel slices in two steps, starting with an aqueous extraction with 5%formic acid in water for 1 h and followed with an organic extractionwith 5% formic acid in 50% CH3CN. Extractions from each step werecentrifuged under vacuum separately, combined in water, and lyophilized.

Isotopic Labeling

Isotopic labeling by enzymatic incorporation of 16O and 18O was used forrelative protein quantitation between LPS-treated and control samples.To label peptides at the carboxyl-terminus with 16O or 18O, samples werere-suspended in H216O or H218O and incubated with 30 μL of washedMag-Trypsin beads (Clontech, Mountain view, Calif.) for 48 h at 37° C.The reaction was monitored by MALDI-TOF MS (4700 Voyager, AppliedBiosystems, Foster City, Calif.). Beads were removed by magneticseparation, labeled samples were lyophilized and re-suspended in 2%CH3CN with 0.2% formic acid in water (mobile phase A) and mixed 1:1(v/v).

Nanoscale LC-MS/MS

A total of 11 LC-MS/MS runs was performed per experiment, correspondingto the number of gel slices. Using an Eksigent AS1 autosampler andauxiliary isocratic pump (Eksigent Technologies, Livermore, Calif.), 10μL injections were loaded at 10 μL/m onto a 2.5-μL Opti-Pak precolumn(Optimize Technologies, Oregon City, Oreg.) packed with 5 μm, 200 ÅMichrom Magic C8 solid phase (Michrom BioResources, Inc., Auburn,Calif.) to remove contaminating salts. Peptides were separated at 350nL/m on a 20-cm×75-μm-inner diameter column packed with 5 μm, 200 ÅMichrom Magic C18 solid phase (Michrom BioResources). A 90 m two-stepchromatographic gradient was used that started with a slow separationfrom 5-50% B over 60 m followed by a rapid increase from 50-95% B over10 m using 80% CH3CN, 10% n-propyl alcohol, and 0.2% formic acid inwater as mobile phase B.

Samples were analyzed on an LTQ-Orbitrap Hybrid FT mass spectrometer(Thermo Scientific). Data were collected in full profile mode from m/z375 to 1,950 at 60,000 resolving power with internal calibrant lockmasses. The five most abundant double- and triple-charged precursorswith a minimum signal of 8,000 between 375-1,600 m/z were subjected tocollision-induced dissociation (CID) with 35% normalized collisionenergy, 30 ms activation time, and activation Q at 0.25. To reducerepeat analyses, dynamic exclusions were established for 60 s with anisolation width of 1.6 m/z units, for low and high mass exclusion of 0.8m/z units each per precursor.

Database Searching

Thermo .raw files were converted to the mzXML format using ReAdW (fromTPP version 4.1) and imported into the CPAS database organization andanalysis application (version 9.10) (Rauch, Bellew et al. 2006).X!tandem (version 2.007.01.01.1) (Craig and Beavis 2003) identifiedpeptides and proteins from fragment ion spectra of selected precursorsusing the non-redundant human international protein index (version 3.53)maintained at the European Bioinformatics Institute (EBI; Hinxton,United Kingdom). Parent ions required less than 20 ppm mass accuracy andgreater than 90% matched molecular weight against expected values basedon the PeptideProphet algorithm (Keller, Nesvizhskii et al. 2002).

Search parameters specified tryptic digestion and allowed only onemissed cleavage per peptide. Cysteine acetylation from iodoacetamidetreatment was set as a fixed modification. S-carbamoylmethylcysteinecyclization at the amino-terminus, pyroglutamic acid formation fromglutamine and glutamate, oxidation of methionine, and single and doubleisotope label incorporation at lysine and arginine were consideredvariable modifications. Although distinct proteins within a family mayshare identical peptides, ambiguous assignments were grouped by a singleprotein identifier based on a representative group member following thelaw of parsimony.

Ion Current Integration

The XPRESS software (version 2.1, from TPP version 3.4) was used withinCPAS to reconstruct peptide elution profiles (Han, Han et al. 2001).Peptide signal intensity was integrated over the number of MS scans inwhich a peptide was observed, thereby providing quantitative areas for16O and 18O labeled peptides. XPRESS was not used to calculate proteinabundance ratios from these areas.

Software Setup

The software described here interacted with a MySQL database that waspopulated with reference data used to filter and organize results.Keywords were defined by the total set of 32,378 terms in 13 categoriesfrom the Universal Protein Resource (Uniprot) and Gene Ontologycatalogs. The complete human repository of proteins from the Uniprotknowledgebase, including protein-specific accession numbers, molecularweight information, and keyword associations, was loaded into the My SQLdatabase.

Software Implementation

The software was written in Java to facilitate platform independence.Reference and experimental data were stored in a MySQL database. TheApache POI library was used to read and write Excel files. The ApacheCommons Math library was used as a standard resource to compareimplemented statistical calculations and to calculate p-values fromt-statistics. The standard analytical software R was also used tocompare and validate implemented statistical calculations and forcluster analysis. The software was run on a standard desktop computerrunning Linux or Mac OS X. An auxiliary program incorporating the PythonImaging Library was used to generate heat map images. Prism (version4.0a) was used to calculate frequency distributions for thevisualization of trends following data analysis and Microsoft Excel wasused to produce graphs.

Computation and Application Control Flow

Using a GUI interface, experimental data were integrated into the MySQLdatabase after export from CPAS as Microsoft Excel files. Excel filescombined MS runs from a single experiment and contained all of theinformation available from CPAS analyses, including columns for peptidesequence, gene name, MS run/fraction name, PeptideProphet score, proteinaccession number, scan number, retention time, and quantitative analysisfields. Further descriptions of these fields are available in the CPASdocumentation.

Fractions named in CPAS-derived Excel files corresponded to MS runs fromeach excised gel slice per experiment. These fractions were defined inunits of molecular weight by user input at the GUI interface. Tosimplify the task of user input, the software searched each Excel fileand identified all fractions. The user provided the approximate maximumand minimum molecular weight restrictions for proteins identified ineach MS run according to the boundaries of the gel slice from which theproteins were extracted during sample preparation.

The software queried experimental data against the human proteinrepository downloaded from UniProt and keyword categories defined byboth UniProt and Gene Ontology catalogs. Six optional filters withuser-defined parameters control the data that was used for quantitativeanalysis (Table 1). For each query, three reports were generated inExcel or CSV formats: a Details Report, a Keyword Overlap Report and aKeyword Overlap Table (examples are available in Supplementary Data). Alarge number of display options are available so that reports could begenerated for readable summaries or detailed analyses.

The Details Report displays filtered peptide sequence data organized bygene, keyword and experiment. Descriptive, normality and t-statisticswere calculated for the population of peptide sequences defined by eachgene, keyword or experiment. To simplify troubleshooting, the DetailsReport included peptide sequences that failed to pass each filter, alongwith the parameters of the filter used to generate the report.

The Keyword Overlap Report compares genes within each keyword categoryand reports an overlap score based on commonality between all pairs ofcategories. Display options for the Keyword Overlap Report include anexplicit list of the genes shared and excluded from each keyword pairand a reiteration of gene and keyword statistics from the DetailsReport.

The Keyword Overlap Table displays the overlap scores calculated for theKeyword Overlap Report in tabular form. Each row and column is a keywordand each table entry is the overlap score for that pair. The tabularformat is analogous to mileage charts between cities in a road atlas andprovides an easy visual aid for distinguishing functional proteingroupings. A user-defined threshold was provided to reduce the displayof overlap scores between keywords to those above a given value.

The control flow for the generation of reports is summarized in FIG. 1.The Gene Section of the Details Report is generated first, beginningwith a SQL query for all of the peptide sequences within an experiment.Five optional filters are embedded in the first SQL query and peptidesthat pass every filter are listed and organized by gene name. Peptidesthat fail each filter are listed in separate indexed sections. In caseswhere identical peptide sequences are identified within CPAS more thanonce, only data with the highest PeptideProphet score are retained.

For all of the peptides that pass the first five filters, fold changeratios are calculated per peptide sequence using the integrated areas ofheavy and light parent ion currents. The sixth filter confirms that thenumber of peptide sequences that pass the first five filters is above auser-defined threshold for each gene.

Peptide sequences that pass all six filters make up the total populationof peptides at the Experiment level. These peptides are used tocalculate the first round of statistics, including descriptive (mean,median, standard deviation and standard error) and normality statistics(skewness, kurtotsis and D'Agostino Omnibus K2 statistic and p-value).The sample size, defined as the number of peptide sequences used tocalculate statistics, is displayed for each Experiment, Keyword, andGene. User input determines whether the Experiment-level median is usedto normalize and/or log2 transform all fold-change ratios at the Genelevel. If either of these transformations are applied, Experiment-levelstatistics followed by gene-level statistics are recalculated.

In addition to descriptive and normality statistics, the softwarecalculates significance statistics for each gene. The software reportst-statistics and p-values resulting from one-sample t-tests using boththe Experiment-level mean and median as the value against which the meanof each Gene is tested. In addition to this formal measure ofsignificance, the software reports the number of Experiment-levelstandard deviations, or standard errors, that the mean of each Gene isdistanced from the Experiment-level mean or median. Although everystatistic is calculated, the user may specify which, if any, aredisplayed in the final report.

While the failures for the peptide sequence count cutoff filter comefrom the program organizing the pool of peptide sequences by gene, thefailure lists for the other five filters come directly from thedatabase. There is a separate query for each filter for the set ofpeptide sequences from the experiment that fails a given filter's test.If there are duplicate entries for a peptide sequence in the resultingset, the one with the highest PeptideProphet score is retained. Thecomplete set of failures forms the Failure Section of the DetailsReport.

Peptide sequences that pass all six optional filters are organized atthe Experiment level, the Gene level, and the Keyword level. Theorganization of results by keyword is determined by the 13 keywordcategory constraints specified by the user at the GUI interface.Statistics are calculated for each keyword term that applies to eachGene out of the total 32,378 terms available. Descriptive, normality,and significance statistics are calculated from the population ofpeptides that are grouped within each term.

Generally, every Gene is associated with more than one keyword term.Statistics for each term are calculated from the population of peptidesthat are grouped within each keyword. The degree of overlap betweenkeywords is calculated by the number of genes that are shared betweenterms, as shown by the following formula:

$\frac{Gab}{{Ga} + {Gb} + {Gab}}$

The terms in the above formula are defined as: G_(ab) is the number ofgenes shared between keywords a and b, G_(a) is the number of genesassociated with keyword a, and G_(b) is the number of genes associatedwith keyword b. The resulting score is a number between 0 and 1, where 1represents complete overlap.

The Keyword Overlap Report lists all Genes that are associated with eachKeyword and Genes that are shared between Keywords. To assist withanalysis, the user has the option to display statistics for each Genewithin each Keyword term. The Keyword Overlap Table converts a flat listof overlap scores for every pair of keywords to a matrix display.

Results

The features and advantages of the present invention should be apparentfrom the following description of the preferred embodiment, whichillustrates, by way of example, the principles and details of a typicaltest procedures that were used to produce specific results.

Generation of Data With Reverse Isotopic Labeling

FIG. 2 illustrates the series of steps to generate data for LC-MS/MSanalysis. To control the standardization of experimental variables, HL60cells were differentiated in culture and split into two groups prior totreatment with LPS. The control group and LPS-treated group were lysedand enriched for phospho-protein complexes on separate affinity columns.The eluate from each column was loaded with equal total protein contentand separated by gel electrophoresis. The equal loading of each gel wasimportant to ensure accurate relative ratios between samples forquantitative analysis.

Fractions determined by molecular weight boundaries in the gel wereexcised, digested and labeled with 16O or 18O at peptide COOH-terminiusing trypsin. Although both protein digestion and peptide labeling werecarried out by trypsin, the two processes were performed in series toensure completion of each reaction. Differentially labeled samples werecombined in equal volumes and submitted for LC-MS/MS analysis. Thecontrol and LPS-treated samples each provided two reverse-labeledpeptide populations that were combined to form pairs from oppositelabeling. This reverse-labeling strategy was intended to providevalidation for peptide quantitation, independent from any bias inlabeling efficiency at different substrate sequences.

Data Filtering Strategy

The CPAS platform was used to identify peptides from fragmentationspectra and calculate average parent ion intensities over the totalnumber of scans in which they were observed. Data exported from CPAScontained over 15,000 peptide identifications per experiment. Initialinspection revealed false pair associations and unreasonably largeratios between heavy and light peptide pairs. For example, searchparameters had not been modified to differentiate between terminallysine residues that would be modified by enzymatic transfer of 18O andinternal lysine residues that would not be modified.

Therefore, peptide pairs were assigned with differences of 4, 8, and 12Da, although a difference of only 2 or 4 Da between heavy and lightpairs was experimentally feasible. In addition, ratios were calculatedfrom paired peptide elution areas that exceeded 1:100. Manual inspectionof the .raw data confirmed that these values were excessive and theresult of false peptide pair associations. Therefore, we implemented aset of logical rules to remove unreliable peptide data that contributedto the inaccuracy of measurements of relative protein abundance.

We began by implementing five filters based on simple arithmeticcalculations that are rooted in the basic model of peptide behaviorduring gel electrophoresis followed by liquid chromatography and massspectrometry. All of the filters were developed to respond to parametersthat are configurable by the user, and all of the filters are ultimatelyoptional. Two filters take into account chromatographic characteristicsof peptides; “scan cutoff” limits the minimum number of scans over whicha peptide must be observed to ensure a reliable elution profile, and“Light-Heavy scan cutoff” limits the minimum number of scans in whichlight and heavy isotope labeled peptides are not both present, therebyrequiring co-elution.

Two additional filters use heavy and light peptide pairs to ensure thequality of quantitative data: “delta mass cutoff” limits the differencein mass between heavy and light labeled peptide pairs, according to theisotopes that were used during sample preparation, and “ratio cutoff”limits the numeric value of the ratio between peptide elution profileareas. The last of the first five filters, “molecular weight cutoff,”imposes a limit on the percentage of error between the expectedmolecular weight of a protein from which a peptide was derived, asreported by UniProt, and the fraction of the gel from which the proteinwas excised. After the removal of peptide data that failed thesefilters, peptides were organized according to the Gene to which theywere assigned. The sixth and final filter, “peptide sequence countcutoff,” removes the genes whose number of peptide sequences is lessthan or equal to the cutoff value.

The final grouping of genes and peptides that remained was used forquantitative analysis. The program output was designed to report allpeptide sequences that did not pass each filter according to parametersspecified by the user. Whether data passes or fails a filter, no data ishidden from the user. Therefore, it is possible to compare reports usingdifferent parameter values. The threshold for each filter wasmanipulated during the analysis of our datasets and the results aresummarized in the proceeding sections.

Filters 1 and 2: Chromatographic Elution Profile

The chromatographic elution profile of peptides provides twocharacteristics that can be used to enforce accurate quantitativeanalysis: the duration of elution and co-elution of peptide pairs.Exploratory studies have determined that increasing the smoothness of apeptide elution profile increases the accuracy of measurements ofpeptide abundance (Yang, Yang et al. 2010). Therefore, a threshold valuedefined by user input was established for the minimum number of totalscans for each labeled peptide.

By requiring a minimum duration for which a peptide is observed in MS,we were able to filter out ions with very short and sporadicappearances. Maximizing the duration of peptide elution was used a proxyfor continuous peptide elution, an important characteristic in peptidechromatography and one that is used by many proteomic tools, includingthe Trans-Proteomic Pipeline (TPP) (Li, Zhang et al. 2003). Even verylow limiting thresholds for the duration of peptide elution weresuccessful at removing input from sporadic ions (Table 2). This had theeffect of reducing false pairings of ion peaks and improving the qualityof data that was included in the final analysis.

The co-elution of isotope-labeled heavy and light peptides by liquidchromatography is one confirmation that they share identical peptidesequences. Co-elution and subsequent analysis in a shared set of MSscans is also a requirement for the accurate comparison of peptide pairion abundance. To act as a true internal reference that minimizes theinfluence of variability in ion intensity between MS scans, peptidepairs must be present in the same MS scan. While algorithms havesuccessfully been used to identify ion pairs within unprocessed MSspectra (Volchenboum, Kristjansdottir et al. 2009), our techniqueverifies peptide co-elution at the level of post-processing. Thisapproach is not computationally intensive and allows end-user controlover filter parameters. A user-defined threshold limits the number of MSscans that are not shared between peptide pairs, thereby maximizing theduration of co-elution.

There are some disadvantages to using filters at the level ofpost-processing data analysis. Upstream software may not provideinformative quantitative parameters and limit the effectiveness ofdownstream filters. For example, we found that data exported from CPASdisplayed identical start and end scans for every heavy and lightpeptide pair. The result was that the filter used to remove peptidepairs with insufficient chromatographic overlap was not effective. Anythreshold value input generated identical output. Inspection of the .rawfiles clearly showed different start and end scans for each peptide inevery pair. This highlights the utility of increased transparency inprocessing software and presents a case for permissive andinformation-rich analyses during early processing steps followed by morestringent analyses based on user-defined parameters in later steps.

Filters 3 and 4: Relative Quantitative Analysis From Labeled Pairs

Peptide pairs are defined by the difference in mass between isotopicallylabeled samples. The data generated for these experiments were derivedfrom sample sets labeled with either 16O or 18O at one or both sites ofthe peptide COOH-terminus. Heavy and light peptide pairs were thereforedefined by a 2 or 4 Da. difference in mass, depending on whether one orboth oxygen atoms at the COOH-terminus were labeled. Incomplete labelingcan lead to several challenges for accurate quantitation and requiresthe use of a specialized application for data processing (Mason,Therneau et al. 2007).

To test for incomplete labeling in these experiments, we imposed athreshold value of 2 Da. for a filter designed to exclude incorrectlypaired peptides from analysis. A maximum difference of 2 Da. excludedall peptide sequences from analysis, confirming complete labeling inboth experiments (Table 2). On the other hand a threshold value of 4 Da.resulted in the exclusion of several hundred peptides (Table 2). Thissuggested that upstream processing software did not exclude internallysine and arginine residues from the identification of peptide pairs.While this filter is useful to define peptide pairs by the difference inmass between them, this filter can also be considered a secondindependent validation of chromatographic co-elution. The set differencein mass between pairs confirms that both peptides are present in thesame spectrum, which is a result of chromatographic co-elution andpeptide sequence identity.

The software was designed so that the user defines expected differencesin mass between heavy and light peptides. To accommodate use with anylabeling scheme, a list of possible values can be used to define peptidepairs at the GUI interface. Although the data used in these experimentswere generated with high mass accuracy, so that an input threshold of4.008 Da. would be appropriate, the software was also designed for usewith data from instruments that provide less confidence. Therefore, anydifference in mass that was within 0.1 Da. of the input value isretained. Because much of the work in quantitating peptide pairs wasperformed by upstream software, a strict threshold did not provide anyadditional benefit in this analysis.

A second filter was imposed to limit inaccuracy in quantitative analysisbetween peptide pairs. In our analysis we noticed that the relativeareas of heavy and light peptide ions sometimes reached extreme valuesnearing 1:100 and 1:1000. These outliers significantly broadened thestandard deviation of the relative peptide ratio mean at the Gene level,reducing confidence in the quantitative analysis. To exclude thesevalues from the analysis, a quantitative threshold was imposed onpeptide pairs that established a minimum relative ratio between peptideion chromatogram areas. Algebraically, this also implies that thoseratios must be less than the reciprocal of the threshold value,providing a limit on maximum values for fold-change ratios.

The ratio cutoff filter removed several hundred peptide sequences in ourdata sets that demonstrated a greater than 20-fold difference inrelative ion areas. Interestingly, around half of the total peptidesequences demonstrated a greater than 2-fold difference in relative ionareas (Table 2). This filter was valuable for investigating thedistribution of relative differences in peptide abundances across theentire experiment and results after exclusion of extreme pairs. Removalof these outliers with extreme values increased the precision andaccuracy of relative peptide ion quantitation and the resulting analysisat the Gene level (MacCoss and Wu 2007).

Filters 5 and 6: Gene Assignments

Peptide sequences were organized by the gene name of proteins to whichthey were assigned in upstream analyses. Organization by gene nameprovided the basis for two additional filters limiting the inclusion ofpeptides in quantitative analyses. The first filter took advantage ofmolecular weight boundaries defined by the gel slice from which aprotein was excised. The filter was intended to limit analysis at theGene level to peptides that were digested by trypsin and were not theresult of protein degradation. It was also intended to preventoversampling of contaminating proteins that were present in every gelslice.

Identified proteins were referenced against the Uniprot database andmolecular weight information was matched against fraction definitionsprovided by the user. The filter established a percentage of error thatwould be tolerated for the protein molecular weight, as determined byUniProt. Peptides from proteins that were identified in appropriatefractions, with added or subtracted error, were retained. The limitationof this filter was that the molecular weight noted by UniProt pertainsto the protein precursor and not to the active form of expressedproteins. Despite this limitation, a broad error allowing two times, or100% of, the expected protein molecular weight removed over six hundredpeptides from the total analysis.

A person skilled in the art would recognize that the removal ofNH2-terminal protein sequences and the addition of variouspost-translational modifications would not be expected to affect theexpected molecular weight by over 100%. Therefore, this filter wasvaluable for the determination of relative protein degradation andcontamination per experiment.

The final filter established that every protein included in the finalpopulation was identified by a minimum number of peptide sequences. Forexample, the quantitative analysis of a protein from one relativepeptide ratio between samples cannot be counted with confidence and thatprotein should be excluded. The advantage of this filter is that it canbe used to limit the analysis to proteins that were sampled at highfrequencies, and therefore identified and analyzed quantitatively withhigh confidence.

Statistics and Heuristics to Guide the Selection of Important Genes

We used descriptive statistics for a preliminary analysis of thefiltered set of peptide sequence ratios. We examined the population ofall peptide sequences in each experiment, for each Keyword grouping, andfor each Gene grouping to get a global view of abundance distributions.The average abundance for peptide heavy and light ratios was close toequal at the Experiment level, confirming equal total protein abundancebetween samples. Nevertheless, each peptide ratio was normalized by themedian of all labeled peptide ratios to ensure that abundancecomparisons were based on a stable baseline.

We used the perspective provided by the total population of labeledpeptide ratios to describe patterns and trends in the data. The standarddeviation and standard error of the mean of all labeled peptide ratiosat the Experiment level highlighted Genes that were very different inabundance from the majority of the population. We also incorporated aone-sample t-test that compared the mean of each Gene or Keywordgrouping against the experimental mean by using the number of peptidesequences in each grouping to determine the degrees of freedom. Byrapidly identifying groups that were significantly different from theexperiment mean we selected for Genes and Keywords that were mostaffected by LPS treatment.

Significantly, this invention uses a unique approach for selectingimportant Genes and Keywords. By contrast, the prior methods that havebeen developed to determine which Genes which merit investigation do nothave specified standards for selection, so that an investigator may notbe able to duplicate the results of another investigator. This isbecause prior methods fail to hold constant the criteria for Geneselection between any two investigators. Because of this variable is notcontrolled, the results of laboratory tests are not consistent betweeninvestigators.

As a significant improvement, this invention implements a set of rulesfor statistics and heuristics that are applied uniformly to data sets todetermine significance a priori for changes in abundance based on thetotal population in an experiment. For example, ASAPRatio uses alog-transformed fitted normal, justified by the central limit theoremfor large sample sizes, and an error function to generate p-values (Li,Zhang et al. 2003). By contrast, GOMiner uses Fisher's exact test andq-values to handle small sample sizes (Zeeberg, Feng et al. 2003).

Small sample sizes are the most common condition for Gene groupings ofproteomic data acquired by mass spectrometry. Accordingly, the t-test isappropriate to adjust the confidence level to reflect the sample size.The t-test has heuristic value and transparency for use as the standardmetric. However, existing calculation methods merely encourage theheuristic use of statistical tests. Although the software presented herecan be used heuristically, we found that by including sample sizes andnormality data, the t-test can be applied appropriately for a strict andrigorous test of significance.

The t-test is widely used to show significance within data sets butrelies on normality assumptions. Previous use of the t-test for dataobtained by mass spectrometry is associated with several challenges. Forexample, implicit assumptions that support the use of the t-test forgenomic microarrays do not carry over consistently to proteomic dataobtained by mass spectrometry. Therefore, t-test has been often beenused with caveats that strongly limit its value. To enable a rigoroususe of the t-test for identifying Gene and Keyword groupings that aresignificantly different from the majority of the population, weimplemented a set of statistics related to descriptions of normality.

Normality was described heuristically by values calculated for skewnessand kurtosis within populations of peptide ratios. As a more formal testfor normality, we implemented D'Agostino-Pearson Omnibus K2 scores andD'Agostino p-values (D'Agostino, Belanger et al. 1990). From thesestatistics, we found that the commonly used approximation of datanormality by log-transformation of fold-change ratios for peptidesequence abundances is not accurate.

A second method that may have been applied to the handling of massspectrometry data following its successful use in microarrays, is thecalculation of relative abundance from the direct ratio between labeledpeptides. We used our software to compare Gene level log-transformationsof mean fold-change ratios calculated from peptide heavy to light ratiosand the mean ratios of their reciprocals.

We also calculated a new metric for fold-change that we found to be moreprecise. The new metric calculates the ratio of heavy or light-labeledpeptides in relation to the total area of both peptide ions overchromatographic time. For every peptide pair, the fold-change ratio isbetween 0 and 1. This simplifies computations and facilitatescomparisons within and between peptide pairs. This metric is convenientfor comparing treated and control samples because abundance ratios foreach share the same denominator and are not defined by a dependentprinciple.

In contrast, the mean of peptide fold-change ratios and theirreciprocals are independent of one another. Most importantly,calculation of the fold-change ratio by relation to a total corrects fora sloping baseline within the total data set. Sloping baselines preventaccurate comparisons between peptide ratios in a data set and severelyaffect the precision of every measurement. This correction is requiredbecause the precision of measurements in mass spectrometry is notdirectly related to the number of ions acquired or the observed ratio,as might be the case for colorimetric ratios obtained in microarrayexperiments (MacCoss, Wu et al. 2002). The new metric defines aninternal scale from 0% to 100% that is common to every peptide ratio.Indeed, by using this new metric for fold-change, we found that standarddeviations for mean peptide ratios at the Gene level were drasticallyreduced.

Using peptide ratios in relation to the total pair area resulted in asecond improvement in data distributions. Fold-change ratios calculatedfrom the total peptide pair area approached normality without thelog-transformation that is commonly used to correct for skewness indirect ratios of peptide ion areas. For the calculation of fold-changefrom the direct ratio between heavy and light-labeled peptides, the lefttail is truncated at 0 while the right tail can be arbitrarily long,introducing skewness. By using a ratio determined by the total pairedpeptide ion area, the range of values is constricted from 0 to 1. Ineffect, both tails are truncated thereby minimizing the standarddeviation of the total population. This also has the effect of producinglow kurtosis scores that increase the power of one-sample t-tests(Reineke, Baggett et al. 2003).

In the end, log-transformation of fold-change ratios at the peptidelevel generate populations that only appear more normal than theuntransformed data. To illustrate this point, we compared skewness andkurtosis values for Log2-transformed heavy:light ratios and heavy:totalratios in our data sets. We found that based on skewness and kurtosisalone, heavy:total ratios showed an improved distribution. By avoidinglog-transformations and approaching normality with the calculation of aconsistent fold-change metric, we were able to avoid the confusingcircumstance of having to transform data and their statistics. Forexample, investigators would not have to specify that all Gene meansfrom log-transformed peptide ratios are geometric means.

Categorical Analysis

While gene names provide a natural first grouping for peptide sequences,they have some drawbacks. Statistically, they can result in very smallsample sizes. Perhaps more importantly from a biological standpoint,they are not the appropriate unit for the analysis of global changes atthe cellular level. One primary example is the determination ofexperimental treatment effects on a network, such as a signalingpathway. In this case, the analysis of changes in abundance at theprotein level and statistical methods to combine results in a meaningfulway can be difficult.

Several descriptive tools have been developed to categorize lists ofgenes by function and report statistical scores for the enrichment ofkeyword categories using Gene identification alone, such as David(Huang, Sherman et al. 2007). In addition, programs such as Scaffold(Searle 2010) and Cytoscape (Shannon, Markiel et al. 2003) generate piecharts showing the distribution of keyword membership and interactionnetworks based on Genes identified in each experiment.

As a significant improvement to these descriptive tools, this inventionincludes a quantitative method for investigating the relative abundanceof keywords in experimental data. Existing keyword dictionaries,included with Uniprot and Gene Ontology, provide gene names by category,including molecular function, cellular compartment, post-translationalmodification and associated ligand. This invention stores thisinformation locally in a database and uses indexed tables and optimizedqueries, to organize protein and peptide sequence data by keyword. Thisinvention allows efficient comparisons from the large database, so thatresults are obtained in an average of thirty seconds. Because eachkeyword captured multiple genes, groupings by keyword generated samplesizes that were more conductive to hypothesis testing.

The richness of the Uniprot and Gene Ontology keywords provides a widelyaccepted set of categories for classification. Groupings by keywordallow for initial comparisons between experiments. For example, insteadof requiring the same protein to be sampled in each experiment, proteinswithin the same keyword category can be observed and grouped for asummary effect. Importantly, classification of experiment data bykeyword provides insights that are not evident when the classificationis by gene name. This invention allows critical protein functionalrelationships to become evident based on quantitative analysis ofrelative peptide abundance, statistical measures of significance, andcategorical keyword groupings.

Keyword categorization provided a second layer of abstraction forproteomic analysis: the degree of keyword overlap within and betweenexperiments. For any two keywords, the number of shared Genes divided bythe total number of unique genes for both keywords gave a percentageoverlap value. Keyword overlap provided an intuitive means for detectingnetworks of Genes that serve multiple functions within a cell. Thisnovel invention is particularly useful for determination of complexpatterns of protein interaction, such as prediction of multiple targetswithin a signaling network following pharmaceutical stimulation orinhibition.

CONCLUSION

For the preferred embodiment, the novel method is used to analyze thedata produced by mass spectrometry equipment in combination with otherspecialized equipment, such as LC-MS/MS (liquid chromatography tandemmass spectrometry). This novel method offers unique advantages overprior analytical methods. The unique benefits are accurate proteinidentification, accurate measurement protein function and activity, andmeasurement of protein interactions between each pair of proteins in asample. With sequential tests over time, the interactive functionalrelationships between proteins can be derived. The analytical resultsare displayed in a manner that allows disclosure of the underlyingstructure of the complex data.

The preferred embodiment of the method is a computer software program,written in Java, which can be used in computers using a standardoperating system, such as Windows or Macintosh. The preferred embodimentis used to calculate the relative difference in protein and keywordabundance from ratios of labeled peptides between control and treatedcells.

The relative abundance is calculated from the mean relative abundance ofgroups of peptides. Although this method could be used based on datafrom entire proteins, peptides are the widely-used standardexperimentally observable unit for mass spectrometry investigations ofproteins (Kuster, Schirle et al. 2005).

Peptides were excluded from analysis based on specified criteria toincrease confidence in the quantitation of relative mean proteinabundance. These specified criteria include the requirement for aminimum of 30 MS scans per peptide, a temporal overlap between MS scansof labeled peptide pairs, and a minimum of two unique peptides perprotein.

To exclude quantitation from degraded protein fragments, a maximumdifference of 30% was required between the expected protein molecularweight, as determined by UniProt (http;//www.uniprot.org), and themolecular weight boundaries of the gel slice from which the protein wasderived. Peptides with very large or very small ratios were generallyindicative of false associations within CPAS and were excluded. To placegreater statistical value on unique peptides within a protein, duplicatepeptides were removed by selecting those with the best Peptide Prophetscores. Unequal 1:1 mixing between the control and treated samples wascorrected by protein normalization using the population median.

The data from the peptide measurements are compared to detailed proteininformation from a widely-accepted database of protein characteristics.Test data are rejected based on a series of standard screening criteriawhich prevent erroneous data from being considered in finalcalculations. The criteria includes six fundamental standards, such asdata that is not consistent with widely-accepted biochemical principles.Based on the refined measurements, the results are sorted by gene and byKeyword classification, to show fundamental relationships for eachprotein.

The data analysis results are unique in showing clearly relationshipsthat were previously incomprehensible due to the complexity of theunderlying data. As a result, critical decisions can be made based onthe meaningful details, such as the extent of protein-proteininteractions, and the effects of inhibitors on kinase activity overtime. In this way, improved decisions can be made with regard tounresolved issues as to protein activity and function, includingsignaling networks and subtle functions of the immune system.

Based on separate multiple measurements over time, the method providesthe factual foundation that would support inference of causation basedon the functional variation of peptide activity for any combination orpermutation of any two pairs of peptides over time. The novel featuresof the invention is error correction features, which reject erroneousmeasurements or calculations from the output of the laboratoryequipment.

This invention interprets and screens the machine outputs, so that anyvariation or deviation from established criteria results in rejection ofthe sample measurement based on a sequential series of screening tests,in exactly the same manner for each peptide in the sample. Therefore,this novel method forces each measured sample to meet specific criteriaas to verification of identification, accuracy of measurement for theamount of activity, and the effect of the inhibitor or other compound onthe activity of the peptide. The accuracy of each measurement isverified through separate controls internal to each sample that isexactly matched as to all possible variables.

This invention allows custom modifications by a person skilled in theart. For example, in addition to the sets of keywords provided byUniprot and Gene Ontology, user-defined sets of keywords can be added tothe database. Also, a person skilled in the art could add widelyaccepted tests for excluding outliers as part of the filtering processfor the raw input and nonparametric statistics to give more avenues forinvestigating the data, especially when small sample sizes are underconsideration.

This invention provides a standardized means of proteomic analysis. Anaccurate description of test conditions and calculation methods isnecessary to allow accurate comparison between tests from differentinvestigators. This invention describes important cellular changes bythe interplay of patterns provided in complete analyses of proteomicdata.

Proteomic experiments provide global observations that may isolatepreviously unknown proteins by their functional importance. However, byselectively reporting changes in a small group of proteins of interest,proteomics studies often fail to describe the accurate intracellularenvironment from which measurements were made. To assure accuratecomparison of test results, a standard calculation method is required(Pedrioli, Eng et al. 2004).

This invention provides a major improvement for calculation of massspectrometry test results. With this new method, an investigator canmodify specific analytical criteria with a clear description, withspecificity, as to each calculation element that was used for each test.This full description of test conditions allows a clear comparison oftest results between investigators.

THE PRIMARY ADVANTAGES OF THE NEW METHOD

Mass spectrometry is the most common instrument used for proteinbiomarker discovery in complex samples. Existing software is successfulat providing statistical evidence that a given protein has beencorrectly identified from measurements of molecular mass and charge.Moving past a simple description, quantitative mass spectrometry isconcerned with determining the difference in abundance for a givenprotein in a comparison of two samples. Current proteomics research goesbeyond mere correct protein identification, to detailed analysis ofsignificant changes in protein abundance. This requires robust,accurate, undistorted measurement of peptide characteristics across awide variety of experimental settings and conditions. For the data to bemeaningful, these measures have to be examined in a biological context.

The critical need for this sophisticated software is shown by recentreports that demonstrate inconsistent results and inaccuratemeasurements from current mass spectrometry equipment and software,despite careful laboratory controls. Although various types of massspectrometry software have been used in proteomics for over a decade,the current emphasis on protein activity within a biological contextresults in the requirement for significant software improvements.

With the new method, mass spectrometry may be used to describe thesignificant functional relationships and activity for sampled proteinswithin the cell. With improved software, the horizons of systems biologycan be expanded to include detailed information on protein signalingnetworks, to complement the expanding knowledge of genomics. This methoddemonstrates unique advantages toward this goal, such as: (a) accuratecalculation of the relative abundance of sampled proteins and functionalkeyword categories, (b) multiple screening to reject errors, sorting byspecified criteria, and specification of the relevant protein activity,(c) display of significant relationships between sampled proteins andkeywords, highlighting quantitative activity signatures specific to theexperiment.

As a significant innovation, this new method addresses importantproblems in proteomic analysis, with significant improvements inaccuracy, reliability, and information content. This new method providesdetailed calculations and functional annotations, to replace priormanual calculations, and supports new avenues of investigation due tothe ability to highlight functional trends in complex data.

This new method was developed to correct the errors found in currentmass spectrometry software, such as lack of capability for accuratereplication of the results of a published experiment, due to failure toadjust quantitation to the characteristics of different equipment and alack of consistent analytical methods for independent scientists toevaluate the data.

This new method corrects the errors of inaccurate measurements withlimited information content, significant identification errors, and highdata dispersion. With prior methods, the identification is merelynominal, based on database association of the protein and gene name,without display of critical network relationships or the effects ofchanged test conditions.

Several key features of this new method are not available with othermethods. One of the key features is heuristic filtering of data toaccommodate the behavior of various mass spectrometry instruments andthe inclusion of the investigator's judgment. The new method allowsadjustments for machine-learning algorithms to optimize analyticalsettings for each particular experiment, thereby automatingstandardization procedures.

Another key advantage is that this new method allows use of a wide arrayof statistical analysis methods, including classical significance tests,to find meaningful changes in protein abundance based on the completesampled population. Useful graphic displays of the data are included inthe method, such as cluster analysis and heat maps, to demonstratesignificant relationships within the data, and to highlight thecorrelation of protein activities in response to specific stimuli.

As a major advantage, this new method allows organization ofquantitative data by recognized keyword categories to reveal changes inthe abundance of functional groups and their degree of overlap. Akeyword is a descriptive classification that is widely recognized andunderstood in the biological research field, as a summary of essentialcharacteristics. The result is a cohesive and functional signature, asopposed to disparate biomarkers with unclear relationships or biologicalcontext. These aspects of the software will be developed further toinclude the generation of self-organizing maps and Bayesian or neuralnetwork models, which will identify new functionally significantrelationships within the data.

The new method allows the investigator to continue to use the functionalterminology of widely used existing databases, such as UniProt and GeneOntology, to organize the observed data into units of statisticalanalysis based on biologically meaningful keywords. This givesresearchers an easy way to look at the question of abundance in terms ofcellular localization, function and disease instead of as lists ofunrelated proteins devoid of a larger context. Using this as a startingpoint, the new method can also quantitatively describe relationshipsbetween categories, which may prove relevant for investigatingnon-specific interactions from therapeutic treatments.

Those skilled in the art will recognize, or be able to ascertain usingno more than routine experimentation, many equivalents to the specificembodiments of the invention described herein. Such equivalents areintended to be encompassed by the following claims.

What is claimed is:
 1. A method comprising: preparing a biologicalsample containing proteins; subjecting the sample to analysis by a massspectrometer; receiving a raw data set from the mass spectrometer, theraw data set representing ion spectra corresponding to peptides orpeptide fragments within the sample; identifying peptides, proteins, orgene names from the raw data set by use of a data processor andprocessing instructions executing thereon, the identifying includingtransforming the raw data set to a structured data set formatted forpattern recognition using the data processor, the transforming includingfiltering the raw data set based on chromatographic and elutioncharacteristics of the sample, the identifying further includingcomparing portions of the structured data set to a protein or gene indexthereby identifying the peptides, proteins, or gene names correspondingto the sample; implementing a set of logical rules to selectively filterthe peptides, proteins, or gene names identified from the structureddata set using filter parameters to isolate filtered peptides, proteins,or gene names; correlating the filtered peptides, proteins, or genenames with corresponding keywords by use of the data processor and theprocessing instructions executing thereon; calculating peptide abundancefrom the filtered peptides, proteins, or gene names by use of the dataprocessor and the processing instructions executing thereon; calculatingprotein abundance from the filtered peptides, proteins, or gene names byuse of the data processor and the processing instructions executingthereon; calculating keyword abundance from the filtered peptides,proteins, or gene names, and correlated keywords by use of the dataprocessor and the processing instructions executing thereon; andgenerating, by use of the data processor and the processing instructionsexecuting thereon, a keyword overlap score based on a correlation of thegene names corresponding to the filtered peptides and proteins and thekeywords associated with the gene names corresponding to the filteredpeptides and proteins.
 2. The method of claim 1 further includinggenerating, by use of the data processor and the processing instructionsexecuting thereon, data and graphic pattern representations of peptidefunctions and interactions based on the calculated peptide abundance,the protein abundance, the keyword abundance, and the keyword overlapscore for the filtered peptides, proteins, or gene names, the data andgraphic pattern representations being generated for presentation on adisplay device, wherein the data and the graphic pattern representationsidentify, quantify, and describe the significant functionalrelationships for each protein within groups of proteins that arecomponents of the biological sample.
 3. The method of claim 1, whereinfiltering the raw data set further comprising: filtering the raw dataset using heuristic filtering of the raw data to accommodate thebehavior of various different mass spectrometers.
 4. The method of claim1 wherein selectively filtering the peptides, proteins, or gene namesusing the filter parameters further comprising: filtering the peptides,proteins, or gene names according to a minimum duration for which thepeptides, proteins, or gene names are observed in the mass spectrometer.5. The method of claim 1 wherein selectively filtering the peptides,proteins, or gene names using the filter parameters further comprising:filtering the peptides, proteins, or gene names according to adifference in mass between isotopically labeled samples.
 6. The methodof claim 1 wherein selectively filtering the peptides, proteins, or genenames using the filter parameters further comprising: filtering thepeptides, proteins, or gene names according to a minimum relative ratiobetween peptide ion chromatogram areas.
 7. The method of claim 1 whereinselectively filtering the peptides, proteins, or gene names using thefilter parameters further comprising: filtering the peptides, proteins,or gene names according to a percentage of error tolerated for a proteinmolecular weight.
 8. The method of claim 1 wherein selectively filteringthe peptides, proteins, or gene names using the filter parametersfurther comprising: filtering the peptides, proteins, or gene namesaccording to whether a protein was identified by a minimum number ofpeptide sequences.
 9. A system comprising: a mass spectrometer foranalysis of a biological sample containing proteins; and a dataprocessor with data processing instructions executing thereon, the dataprocessing instructions being configured to cause the data processor to:receive a raw data set from the mass spectrometer, the raw data setrepresenting ion spectra corresponding to peptides or peptide fragmentswithin the sample; identify peptides, proteins, or gene names from theraw data set by use of the data processor and the processinginstructions executing thereon, the identifying including transformingthe raw data set to a structured data set formatted for patternrecognition using the data processor, the transforming includingfiltering the raw data set based on chromatographic and elutioncharacteristics of the sample, the identifying further includingcomparing portions of the structured data set to a protein or gene indexthereby identifying the peptides, proteins, or gene names correspondingto the sample; implement a set of logical rules to selectively filterthe peptides, proteins, or gene names identified from the structureddata set using filter parameters to isolate filtered peptides, proteins,or gene names; correlate the filtered peptides, proteins, or gene nameswith corresponding keywords by use of the data processor and theprocessing instructions executing thereon; calculate peptide abundancefrom the filtered peptides, proteins, or gene names by use of the dataprocessor and the processing instructions executing thereon; calculateprotein abundance from the filtered peptides, proteins, or gene names byuse of the data processor and the processing instructions executingthereon; calculate keyword abundance from the filtered peptides,proteins, or gene names, and correlated keywords by use of the dataprocessor and the processing instructions executing thereon; andgenerate, by use of the data processor and the processing instructionsexecuting thereon, a keyword overlap score based on a correlation of thegene names corresponding to the filtered peptides and proteins and thekeywords associated with the gene names corresponding to the filteredpeptides and proteins.
 10. The system of claim 9 wherein the dataprocessing instructions being configured to cause the data processor togenerate data and graphic pattern representations of peptide functionsand interactions based on the calculated peptide abundance, the proteinabundance, the keyword abundance, and the keyword overlap score for thefiltered peptides, proteins, or gene names, the data and graphic patternrepresentations being generated for presentation on a display device,wherein the data and the graphic pattern representations identify,quantify, and describe the significant functional relationships for eachprotein within groups of proteins that are components of the biologicalsample.
 11. The system of claim 9 wherein the data processinginstructions being configured to cause the data processor to filter theraw data set using heuristic filtering of the raw data to accommodatethe behavior of various different mass spectrometers.
 12. The system ofclaim 9 wherein the data processing instructions being configured tocause the data processor to selectively filter the peptides, proteins,or gene names according to a minimum duration for which the peptides,proteins, or gene names are observed in the mass spectrometer.
 13. Thesystem of claim 9 wherein the data processing instructions beingconfigured to cause the data processor to selectively filter thepeptides, proteins, or gene names according to a difference in massbetween isotopically labeled samples.
 14. The system of claim 9 whereinthe data processing instructions being configured to cause the dataprocessor to selectively filter the peptides, proteins, or gene namesaccording to a minimum relative ratio between peptide ion chromatogramareas.
 15. The system of claim 9 wherein the data processinginstructions being configured to cause the data processor to selectivelyfilter the peptides, proteins, or gene names according to a percentageof error tolerated for a protein molecular weight.
 16. The system ofclaim 9 wherein the data processing instructions being configured tocause the data processor to selectively filter the peptides, proteins,or gene names according to whether a protein was identified by a minimumnumber of peptide sequences.